knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(here)
library(plotly)
library(scales)
library(kableExtra)
library(scales)Workflow
- Gather all required input files (vcf and beds)
- Create bed files of positions of interest by taking chromosome, start, and end positions, then creating a new column where those are concatenated.
- Use bedtools slop to extend the start and end columns by however many bases you need (value is 1/2 total pad value)
- Use coveragebed (wrapper in snakemake) to calculate bases covered at positions
- Use bcftools annotate to annotate the vcfs with the positions of interest information
- Use bcftools query to pull out rows with positions of interest marks
- Analyze and visualize results in R.
Diagram of Snakemake rules and workflow
via
snakemake --dag | dot -Tsvg > dag.svg
Data File Collection
STEP 1
Collected necessary files. Per Nate I had to retrieve the following sets of files:
- v4.2.1 HG002
- v4.2.1 HG003
- v0.6 HG002
- v1.0.0 CMRG (small variants and structural variants) See the resources/README for details.
STEP 2
Generated bed file of positions of interest. From the “Larger Variants list to explore for Cergentis_06.21.21” excel file provided by Samantha Maragh, I took the first four columns (Chr, hg19 start, hg19 end, hg38 start, hg38 end) and moved it into a new file “poi.tsv”. Samantha added 3 new positions based on the Genomic Vision list: - 2 49304890 49316235 49532029 49543374 - 6 76727407 76749167 77437124 77458884 - 7 143127742 143196806 142824835 142893899
This resulted in a total of 22 positions of interest. For any insertions - there are no end coordiantes. Per Nate, just take the start and add +1 base. This was done for each genome, the headers removed, and placed into the two files: hg19_poi.bed & hg38_poi.bed
NOTE for deletions where there was no end coordinate, I made the end coordinate +1 start coordinate. I also added a column to each bed file where I concatenated the chrom_start_end as unique variant IDs (will be column VOI later)
These are all in the data/ directory. See data/README for details.
STEP 3
Nate & I determined I would try to do the work via the command line first, manually, then adapt it to Snakemake. This work is contained in the Commands section of this document.
Commands
Determining which flags need to be used, resolving any issues locally before porting over the process to Snakemake. Starting with GRCh37 50 kb to test.
Bedtools slop
Snakemake will use this wrapper: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bedtools/slop.html
The code in the wrapper:
__author__ = "Jan Forster"
__copyright__ = "Copyright 2019, Jan Forster"
__email__ = "j.forster@dkfz.de"
__license__ = "MIT"
from snakemake.shell import shell
## Extract arguments
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"(bedtools slop"
" {extra}"
" -i {snakemake.input[0]}"
" -g {snakemake.params.genome}"
" > {snakemake.output})"
" {log}"
)
So command line here would be:
Add 50kb
bedtools slop -b 25000 -i cergentis_variant_surveillance-master/data/hg19_poi.bed -g cergentis_variant_surveillance-master/data/hg19_genome.txt > hg19_poi_50kb_slop.bed
-b 25000 adds 25kb on either side of the position of interest, totalling 50 kb padding.
Bedtools coverage
Will use snakemake wrapper: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bedtools/coveragebed.html Wrapper code:
__author__ = "Patrik Smeds"
__copyright__ = "Copyright 2019, Patrik Smeds"
__email__ = "patrik.smeds@gmail.com"
__license__ = "MIT"
from snakemake.shell import shell
shell.executable("bash")
log = snakemake.log_fmt_shell(stdout=False, stderr=True)
extra_params = snakemake.params.get("extra", "")
input_a = snakemake.input.a
input_b = snakemake.input.b
output_file = snakemake.output[0]
if not isinstance(output_file, str) and len(snakemake.output) != 1:
raise ValueError("Output should be one file: " + str(output_file) + "!")
shell(
"coverageBed"
" -a {input_a}"
" -b {input_b}"
" {extra_params}"
" > {output_file}"
" {log}"
)
Requires bam but Nate’s MRG snakefile uses a bed file:
rule calc_gene_coverage:
input:
a=ensembl_dir + "/{ref}_mrg_full_{region}.bed",
b=benchdir + "/HG002_{ref}_{benchmarkset}_{benchtype}.bed"
output: "workflow/results/cov_tbls/HG002_{ref}_{benchmarkset}_{benchtype}_{region}_cov.tsv"
threads: 2
wrapper: "0.74.0/bio/bedtools/coveragebed"
50kb - GRCh37 benchmark bed
coverageBed -a /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed -b /Users/sdm8/Desktop/cergentis_variant_surveillance-master/resources/HG002_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed > ~/Desktop/test_output/hg19_50kb_GRCh37benchmark_coverage.bed
bcftools filter
Filter the vcf file according to the regions defined in the positions of interest bed file.
bcftools filter /Users/sdm8/Desktop/cergentis_variant_surveillance-master/resources/HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz -R /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed > ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered.vcf
bcftools annotate
Make a header text file to define what the header text will be.
echo '##INFO=<ID=VOI,Number=1,Type=String,Description="Variant of Interest ID">' > ~/Desktop/header.txt
Annotate the sorted bed file.
bcftools annotate -a /Users/sdm8/Desktop/test_output/hg19_poi_50kb_slop.bed -c 'CHROM,FROM,TO,INFO/VOI' -h ~/Desktop/test_output/header.txt ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered.vcf > ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.vcf
bcftools query
bcftools query -f '%CHROM\t%POS\t[ %GT]\t%TYPE\t%INFO/VOI\t%REF\t%ALT\n' ~/Desktop/test_output/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.vcf > ~/Desktop/HG002_GRCh37_1_22_v4.2.1_benchmark_filtered_annotated.tsv
Run snakemake:
conda activate snakemake-tutorial
snakemake --cores 1 --use-conda
(Had to remove the filter step - could not get it to work. will instead filter in R)
Query coverage bed file cols:
n_regions is the number of regions in the bed file overlapping the gene. For your work, this is the number of region (rows of the benchmark regions bed file) that overlap with the window around a variant of interest).
bp_cov is the number of bases that are covered by the benchmark regions. In your case this is the number of bases in the 50 or 100 kb window around a variant of interest that is included in the benchmark regions.
cov is the fraction of bases covered
## read in snakemake file and display
cat(readLines('~/Desktop/cergentis_variant_surveillance/workflow/snakefile'), sep = '\n')## from snakemake.utils import min_version
##
## ## set minimum snakemake version
## min_version("6.4.0")
##
## ###########################
## ## DEFINE WILDCARDS
## ###########################
##
## SLOPS = ["50000"]
## BENCHSETS = ["v4_smallvar", "v0.6.2_SVs-Tier1-noVDJorXorY"]
## GIABIDS = ["HG002", "HG003"]
##
## ## Add wildcard constraints
## wildcard_constraints:
## slop="|".join(SLOPS),
## giab_id="|".join(GIABIDS),
## benchset="|".join(BENCHSETS)
##
## ###########################
## ## RULES
## ###########################
##
## ## define all important and final outputs
## ## tells snakemake to do all rules that make these files
## rule all:
## input:
## expand("results/query/{giab_id}_GRCh37_v4_smallvar_{slop}_anno.tsv",
## giab_id = GIABIDS, slop = SLOPS),
## expand("results/query/HG002_GRCh37_v0.6.2_SVs-Tier1-noVDJorXorY_{slop}_anno.tsv",
## slop = SLOPS),
## expand("results/coveragebed/poi_{slop}_HG002_GRCh37_v0.6.2_SVs-Tier1-noVDJorXorY_cov.bed",
## slop = SLOPS),
## expand("results/coveragebed/poi_{slop}_{giab_id}_GRCh37_v4_smallvar_cov.bed",
## giab_id = GIABIDS, slop = SLOPS)
##
##
## ## Prepare and Rename input benchmarkset files
## rule prepare_v4:
## input:
## v4_smallvar_vcf = "resources/{giab_id}_GRCh37_1_22_v4.2.1_benchmark.vcf.gz",
## v4_smallvar_bed = "resources/{giab_id}_GRCh37_1_22_v4.2.1_benchmark_noinconsistent.bed"
## output:
## v4_smallvar_vcf = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz",
## v4_smallvar_bed = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
## shell: """
## cp {input.v4_smallvar_vcf} {output.v4_smallvar_vcf}
## cp {input.v4_smallvar_bed} {output.v4_smallvar_bed}
## """
##
## rule prepare_sv:
## input:
## sv_tier1_vcf = "resources/{giab_id}_v0.6.2_SVs-Tier1-noVDJorXorY.vcf.gz",
## sv_tier1_bed = "resources/{giab_id}_v0.6.2_SVs-Tier1-noVDJorXorY.bed"
## output:
## sv_tier1_vcf = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz",
## sv_tier1_bed = "workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
## shell: """
## cp {input.vcf} {output.vcf}
## cp {input.bed} {output.bed}
## """
##
##
## ## BEDTOOLS SORT
## ## sort manually generated poi bed
## rule bedtools_sort:
## input:
## in_file="data/GRCh37_poi.bed",
## genome="data/GRCh37_genome.txt"
## output:"results/sorted/GRCh37_poi_sorted.bed"
## wrapper: "0.77.0/bio/bedtools/sort"
##
##
## ## BEDTOOLS SLOP
## ## add 50 and 100 kb surrounding positions of interest
## rule bedtools_pad:
## input: "results/sorted/GRCh37_poi_sorted.bed"
## output: "results/slop/GRCh37_poi_{slop}.bed"
## params:
## genome="data/GRCh37_genome.txt",
## ## Add optional parameters
## extra = "-b {slop}"
## log: "logs/slop/GRCh37_poi_{slop}.bed.log"
## wrapper: "v0.75.0/bio/bedtools/slop"
##
## ## BEDTOOLS COVERAGE
## ## Calulate number of included bases for each benchmark set in the 50 and 100 kb windows
## rule coverageBed:
## input:
## a="results/slop/GRCh37_poi_{slop}.bed",
## b="workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.bed"
## output: "results/coveragebed/poi_{slop}_{giab_id}_GRCh37_{benchset}_cov.bed"
## log: "logs/coveragebed/poi_{slop}_{giab_id}_GRCh37_{benchset}_cov.log"
## threads: 6
## wrapper: "0.76.0/bio/bedtools/coveragebed"
##
##
## ## BCFTOOLS ANNOTATE
## ## make header file for annotate_vcf rule
## rule make_hdr_file:
## output: "workflow/data/header.txt"
## shell: """
## echo '##INFO=<ID=VOI,Number=1,Type=String,Description="Variant of Interest ID">' > {output}
## """
##
##
## rule annotate_vcf:
## input:
## bed="results/slop/GRCh37_poi_{slop}.bed",
## header="workflow/data/header.txt",
## vcf="workflow/data/benchmark_sets/{giab_id}_GRCh37_{benchset}.vcf.gz"
## output:"results/annotate/{giab_id}_GRCh37_{benchset}_{slop}_anno.vcf"
## conda: "envs/bcftools.yml"
## shell: """
## bcftools annotate \
## -a {input.bed} \
## -c CHROM,FROM,TO,INFO/VOI \
## -h {input.header} \
## {input.vcf} \
## > {output}
## """
##
##
## ## BCFTOOLS QUERY
## ## Make variant table
## rule make_smallvar_tbl:
## input:"results/annotate/{giab_id}_GRCh37_{benchset}_{slop}_anno.vcf"
## output:"results/query/{giab_id}_GRCh37_{benchset}_{slop}_anno.tsv"
## conda:"envs/bcftools.yml"
## shell: """
## bcftools query \
## -f '%CHROM\\t%POS\\t[ %GT]\\t%TYPE\\t%INFO/VOI\\t%REF\\t%ALT\\n' \
## {input} \
## > {output}
## """
Analysis
Run Notes
We used only GRCh37 so we could include the SV input files (only performed on GRCh37). Final positions will be converted to GRCh38.
We used only the small variant and SV (excluding the CMRG) as input.
All end posiitons of the GRCh38 poi bed file was start+1 to maintain a constant 50kb window (the extra 1 base is regarded as okay).
Objective:
Catalog all variants (both small and structural variants) within 50 kb of structural variants previously identified for the GEC mixture study.
Approach:
- Make a summary df with all of the result tsv files (add col for ID)
- Filter based on INFO/VOI tag in INFO columns (those without = outside of poi range)
- Summarise: At each poi: the number of variants, size distribution, type of variants
ID the poi’s with the widest range of variants (#, type diversity, size range)
Positions of Interest
GRCh37
The end positions were all start+1 so as to keep the padding in a 50kb window at all times (excluding those positions whose variants were large extending the end position nearing or exceeding 50kb).
(grch37_poi_bed <- read.table(here("data/GRCh37_poi.bed"),header = FALSE, sep="\t",stringsAsFactors=FALSE, quote="") %>%
rename(chromosome = V1, start_coord = V2, end_coord = V3, chrom_start_end = V4)) ## chromosome start_coord end_coord chrom_start_end
## 1 2 206241116 206241117 2_206241116_206241117
## 2 2 49561482 49561483 2_49561482_49561483
## 3 3 127859142 127859143 3_127859142_127859143
## 4 3 162512133 162512134 3_162512133_162512134
## 5 4 103764056 103764057 4_103764056_103764057
## 6 5 117868224 117868225 5_117868224_117868225
## 7 6 156081224 156081225 6_156081224_156081225
## 8 7 16171773 16171774 7_16171773_16171774
## 9 9 10163173 10163174 9_10163173_10163174
## 10 10 66299841 66299842 10_66299841_66299842
## 11 10 81695836 81695837 10_81695836_81695837
## 12 10 82524373 82524374 10_82524373_82524374
## 13 12 53499604 53499605 12_53499604_53499605
## 14 12 38824344 38824345 12_38824344_38824345
## 15 13 90154843 90154844 13_90154843_90154844
## 16 13 42223740 42223741 13_42223740_42223741
## 17 14 105149647 105149648 14_105149647_105149648
## 18 15 79844900 79844901 15_79844900_79844901
## 19 16 82791081 82791082 16_82791081_82791082
## 20 2 49532029 49532030 2_49532029_49532030
## 21 6 77437124 77437125 6_77437124_77437125
## 22 7 142824835 142824836 7_142824835_142824836
Load & Munge Data
Read in the Snakemake results files and organize into annotated data frames.
##################### QUERY RESULT FILES
poi_defns <- read_tsv(file=here("data/GRCh37_POI_definitions.tsv"), col_names=FALSE) %>%
rename(chromosome = X1, start = X2, end = X3, VOI = X4, initial_var_type = X5, initial_var_size = X6) %>%
select(-chromosome, -start, -end)
## get list of files in results directory
file_list <- as.list(list.files(path = here("results/query"),
pattern = "anno.tsv",
include.dirs = TRUE, full.names = TRUE,
recursive = TRUE))
## strip the path portion of the file names
file_names <- str_remove(file_list, paste0(here("results/query"),"/"))
## organize the file list to read them into a df
df_lst <- set_names(file_list, file_names)
## make that into a data frame
metadata_df <- tibble(query_file = unlist(file_names)) %>%
separate(query_file, c("giab_id","ref_genome", "benchset", "bench-type", "pad_size", "temp"),
sep = "_", remove = FALSE) %>%
select(-temp)
## read in files and create 1 df
cergentis_df <- df_lst %>%
map_dfr(read_tsv, .id = "query_file", col_names = c("chrom", "pos", "GT", "var_type", "VOI", "ref", "alt")) %>%
## filter out regions outside of poi (no VOI entry)
filter(!VOI == ".")
## Annotating with sample metadata
cergentis_anno_df <- cergentis_df %>%
left_join(metadata_df) %>%
left_join(poi_defns, by = "VOI")
##################### COVERAGE RESULT FILES
## get files in a list to read in
cov_files <- list.files(path = here("results/coveragebed/"),
pattern = ".bed") %>%
set_names(~ str_remove(., "_cov.bed")) %>%
set_names(~ str_remove(., "poi_")) %>%
map(~ here("results/coveragebed/", .))
## build df
cov_df <- cov_files %>%
## read in files
map_dfr(
read_tsv,
col_names = c(
"chrom",
"start",
"end",
"chrom_start_end",
"n_regions",
"bp_cov",
"gene_size",
"cov"),
col_types = "ciiciiid",
.id = "covset"
) %>%
## separate the file name into columns with metadata
separate(
col = covset,
into = c("giab_id", "ref", "benchmark", "bench_type", "region"),
sep = "_",
remove = TRUE
)HG002
Visualize & Summarize Variants
Prepare Data
Filter data and organize into dataframes for visualization.
# filter to include only HG002
HG002_cergentis_anno_df <- cergentis_anno_df %>%
filter(giab_id == "HG002")
################# CALCULATE VARIANT SIZE
## Calculating INDEL Variant Size
get_var_size <- function(ref, alt){
ref_bp <- nchar(ref)
## Dealing with multiple ALTs
# --- using largest SV
if(str_detect(alt, ",")){
alts <- unlist(str_split(alt, ","))
alts_bp <- nchar(alts)
var_sizes <- alts_bp - ref_bp
var_size <- var_sizes[abs(var_sizes) == max(abs(var_sizes))]
## returning first entry value if the two genotypes are the same size
return(var_size[1])
}
return(nchar(alt) - ref_bp)
}
## Getting variant size into df
HG002_sized_var_df <- HG002_cergentis_anno_df %>%
rowwise() %>%
mutate(var_size = get_var_size(ref, alt))
################## COUNTS VARIANTS PER POI
# count how many variants are within each poi
HG002_num_vars <- HG002_sized_var_df %>%
group_by(VOI, pad_size) %>%
tally() %>%
rename(num_vars = n)
# count how many of each type of variant is within each poi
HG002_type_vars <- HG002_sized_var_df %>%
group_by(VOI, var_type, pad_size) %>%
tally() %>%
pivot_wider(names_from = var_type, values_from = n, values_fill = 0) %>%
rename(num_SNP = SNP, num_INDEL = INDEL, num_SNP_INDEL = 'SNP,INDEL')
# df where variant type counts are summarized
HG002_var_counts_summary_df <- HG002_num_vars %>%
left_join(HG002_type_vars, by = "VOI") %>%
select(-pad_size.x, -pad_size.y)
# the sum of counts of types of vars should = num of vars total
HG002_sanity_check <- HG002_var_counts_summary_df %>%
rowwise() %>%
mutate(total = sum(num_SNP,num_INDEL,num_SNP_INDEL, na.rm = T)) %>%
mutate(check = (num_vars == total)) %>%
filter(check != TRUE) %>%
nrow(.) == 0
# add the var count and types into the var size df
HG002_variant_df <- HG002_sized_var_df %>%
left_join(HG002_num_vars, by = "VOI") %>%
left_join(HG002_type_vars, by = "VOI")%>%
mutate(coord = paste(chrom, pos, sep="_"))
write.table(HG002_variant_df, file = here("results/analysis/HG002_variant_df.tsv"), sep = "\t", row.names = FALSE)Variant Type per POI
Per position of interest (POI) how many of each type of variant exists across the reference benchmarks within 50 kb on either side?
HG002_var_counts_summary_plot <- HG002_var_counts_summary_df %>%
select(-num_vars) %>%
rename(SNP = num_SNP, INDEL = num_INDEL, SNP_INDEL = num_SNP_INDEL) %>%
pivot_longer(names_to = "var_type", values_to = "count",
cols = c("SNP", "INDEL", "SNP_INDEL"))
# color palette definition
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# bar plot of counts per variant type
ggplotly(ggplot(HG002_var_counts_summary_plot) +
geom_bar(aes(x = VOI, y = count, fill = var_type),
position = position_dodge(preserve = "single"), stat = "identity") +
scale_y_continuous(labels = comma) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Position of Interest (chrom_start_stop)", y = "Total Count",
fill = "Variant Type") + scale_fill_manual(values=cbPalette))# table view of plot
(HG002_var_counts_summary_df %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))| VOI | num_vars | num_INDEL | num_SNP | num_SNP_INDEL |
|---|---|---|---|---|
| 10_66299841_66299842 | 182 | 36 | 146 | 0 |
| 10_81695836_81695837 | 398 | 46 | 352 | 0 |
| 10_82524373_82524374 | 440 | 34 | 406 | 0 |
| 12_38824344_38824345 | 222 | 36 | 184 | 2 |
| 12_53499604_53499605 | 348 | 68 | 280 | 0 |
| 13_42223740_42223741 | 322 | 40 | 282 | 0 |
| 13_90154843_90154844 | 392 | 92 | 300 | 0 |
| 14_105149647_105149648 | 208 | 24 | 184 | 0 |
| 15_79844900_79844901 | 596 | 78 | 518 | 0 |
| 16_82791081_82791082 | 592 | 52 | 540 | 0 |
| 2_206241116_206241117 | 480 | 62 | 418 | 0 |
| 2_49532029_49532030 | 390 | 62 | 328 | 0 |
| 2_49561482_49561483 | 26 | 8 | 18 | 0 |
| 3_127859142_127859143 | 210 | 48 | 162 | 0 |
| 3_162512133_162512134 | 140 | 30 | 110 | 0 |
| 4_103764056_103764057 | 354 | 64 | 290 | 0 |
| 5_117868224_117868225 | 482 | 54 | 428 | 0 |
| 6_156081224_156081225 | 188 | 28 | 160 | 0 |
| 6_77437124_77437125 | 266 | 28 | 238 | 0 |
| 7_142824835_142824836 | 2 | 2 | 0 | 0 |
| 7_16171773_16171774 | 318 | 34 | 284 | 0 |
| 9_10163173_10163174 | 326 | 42 | 284 | 0 |
Variant Size per POI
Per position of interest (POI) what are the size ranges of variants that exist across the reference benchmarks within 50 kb on either side?
SNPs
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_snp_variant_plot_df <- HG002_variant_df %>%
filter(var_type == "SNP") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI) %>%
summarise(count = n()) ## count how many of each size per poi
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG002_snp_variant_plot_df) +
geom_bar(aes(x = VOI, y = count),
stat = "identity") +
scale_y_continuous(labels = comma) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count",
fill = "Variant Type", title = "Number of SNP Variants per POI") + scale_fill_manual(values=cbPalette))#(snp_variant_plot_df %>%
# kbl() %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))INDEL
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_indel_variant_plot_df <- HG002_variant_df %>%
filter(var_type == "INDEL") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi
############## SEPARATE DFS
## too many VOIs to render plot nicely - split up and render separately
VOI_lst <- c("2_206241116_206241117", "2_49532029_49532030","2_49561482_49561483",
"3_127859142_127859143","3_162512133_162512134")
HG002_indel_voi_chrom_23 <- HG002_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("4_103764056_103764057", "5_117868224_117868225",
"6_156081224_156081225","6_77437124_77437125")
HG002_indel_voi_chrom_456 <- HG002_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("7_142824835_142824836", "7_16171773_16171774","9_10163173_10163174",
"10_66299841_66299842","10_81695836_81695837", "10_82524373_82524374")
HG002_indel_voi_chrom_7910 <- HG002_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("12_38824344_38824345", "12_53499604_53499605",
"13_42223740_42223741","13_90154843_90154844")
HG002_indel_voi_chrom_1213 <- HG002_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("14_105149647_105149648", "15_79844900_79844901","16_82791081_82791082")
HG002_indel_voi_chrom_141516 <- HG002_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG002_indel_voi_chrom_23) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 2 & 3") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG002_indel_voi_chrom_456) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 4, 5, & 6") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG002_indel_voi_chrom_7910) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 7, 0, & 10") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG002_indel_voi_chrom_1213) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 12 & 13") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG002_indel_voi_chrom_141516) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 14, 15, & 16") + scale_fill_manual(values=cbPalette))########## TABLE
#(indel_variant_plot_df %>%
# kbl() %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))SNP or INDEL
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG002_snp_indel_variant_plot_df <- HG002_variant_df %>%
filter(var_type == "SNP,INDEL") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG002_snp_indel_variant_plot_df) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count") + scale_fill_manual(values=cbPalette))(HG002_snp_indel_variant_plot_df %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))| VOI | var_size | count |
|---|---|---|
| 12_38824344_38824345 | 6 | 2 |
HG003
Visualize & Summarize Variants
Prepare Data
Filter data and organize into dataframes for visualization.
# filter to include only HG003
HG003_cergentis_anno_df <- cergentis_anno_df %>%
filter(giab_id == "HG003")
################# CALCULATE VARIANT SIZE
## Calculating INDEL Variant Size
get_var_size <- function(ref, alt){
ref_bp <- nchar(ref)
## Dealing with multiple ALTs
# --- using largest SV
if(str_detect(alt, ",")){
alts <- unlist(str_split(alt, ","))
alts_bp <- nchar(alts)
var_sizes <- alts_bp - ref_bp
var_size <- var_sizes[abs(var_sizes) == max(abs(var_sizes))]
## returning first entry value if the two genotypes are the same size
return(var_size[1])
}
return(nchar(alt) - ref_bp)
}
## Getting variant size into df
HG003_sized_var_df <- HG003_cergentis_anno_df %>%
rowwise() %>%
mutate(var_size = get_var_size(ref, alt))
################## COUNTS VARIANTS PER POI
# count how many variants are within each poi
HG003_num_vars <- HG003_sized_var_df %>%
group_by(VOI, pad_size) %>%
tally() %>%
rename(num_vars = n)
# count how many of each type of variant is within each poi
HG003_type_vars <- HG003_sized_var_df %>%
group_by(VOI, var_type, pad_size) %>%
tally() %>%
pivot_wider(names_from = var_type, values_from = n, values_fill = 0) %>%
rename(num_SNP = SNP, num_INDEL = INDEL, num_SNP_INDEL = 'SNP,INDEL')
# df where variant type counts are summarized
HG003_var_counts_summary_df <- HG003_num_vars %>%
left_join(HG003_type_vars, by = "VOI") %>%
select(-pad_size.x, -pad_size.y)
# the sum of counts of types of vars should = num of vars total
HG003_sanity_check <- HG003_var_counts_summary_df %>%
rowwise() %>%
mutate(total = sum(num_SNP,num_INDEL,num_SNP_INDEL, na.rm = T)) %>%
mutate(check = (num_vars == total)) %>%
filter(check != TRUE) %>%
nrow(.) == 0
# add the var count and types into the var size df
HG003_variant_df <- HG003_sized_var_df %>%
left_join(HG003_num_vars, by = "VOI") %>%
left_join(HG003_type_vars, by = "VOI")%>%
mutate(coord = paste(chrom, pos, sep="_"))
write.table(HG003_variant_df, file = here("results/analysis/HG003_variant_df.tsv"), sep = "\t", row.names = FALSE)Variant Type per POI
Per position of interest (POI) how many of each type of variant exists across the reference benchmarks within 50 kb on either side?
HG003_var_counts_summary_plot <- HG003_var_counts_summary_df %>%
select(-num_vars) %>%
rename(SNP = num_SNP, INDEL = num_INDEL, SNP_INDEL = num_SNP_INDEL) %>%
pivot_longer(names_to = "var_type", values_to = "count",
cols = c("SNP", "INDEL", "SNP_INDEL"))
# color palette definition
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# bar plot of counts per variant type
ggplotly(ggplot(HG003_var_counts_summary_plot) +
geom_bar(aes(x = VOI, y = count, fill = var_type),
position = position_dodge(preserve = "single"), stat = "identity") +
scale_y_continuous(labels = comma) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Position of Interest (chrom_start_stop)", y = "Total Count",
fill = "Variant Type") + scale_fill_manual(values=cbPalette))# table view of plot
(HG003_var_counts_summary_df %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))| VOI | num_vars | num_INDEL | num_SNP | num_SNP_INDEL |
|---|---|---|---|---|
| 10_66299841_66299842 | 75 | 15 | 60 | 0 |
| 10_81695836_81695837 | 8 | 4 | 4 | 0 |
| 10_82524373_82524374 | 60 | 5 | 55 | 0 |
| 12_38824344_38824345 | 158 | 27 | 130 | 1 |
| 12_53499604_53499605 | 28 | 6 | 22 | 0 |
| 13_42223740_42223741 | 52 | 6 | 46 | 0 |
| 13_90154843_90154844 | 205 | 41 | 164 | 0 |
| 14_105149647_105149648 | 103 | 17 | 86 | 0 |
| 15_79844900_79844901 | 136 | 15 | 120 | 1 |
| 16_82791081_82791082 | 246 | 23 | 223 | 0 |
| 2_206241116_206241117 | 281 | 32 | 249 | 0 |
| 2_49532029_49532030 | 149 | 25 | 124 | 0 |
| 2_49561482_49561483 | 41 | 8 | 33 | 0 |
| 3_127859142_127859143 | 35 | 9 | 26 | 0 |
| 3_162512133_162512134 | 60 | 13 | 47 | 0 |
| 4_103764056_103764057 | 9 | 2 | 7 | 0 |
| 5_117868224_117868225 | 156 | 17 | 139 | 0 |
| 6_156081224_156081225 | 110 | 20 | 90 | 0 |
| 6_77437124_77437125 | 121 | 9 | 112 | 0 |
| 7_142824835_142824836 | 1 | 1 | 0 | 0 |
| 7_16171773_16171774 | 111 | 14 | 97 | 0 |
| 9_10163173_10163174 | 266 | 39 | 227 | 0 |
Variant Size per POI
Per position of interest (POI) what are the size ranges of variants that exist across the reference benchmarks within 50 kb on either side?
SNPs
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_snp_variant_plot_df <- HG003_variant_df %>%
filter(var_type == "SNP") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI) %>%
summarise(count = n()) ## count how many of each size per poi
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG003_snp_variant_plot_df) +
geom_bar(aes(x = VOI, y = count),
stat = "identity") +
scale_y_continuous(labels = comma) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count",
fill = "Variant Type", title = "Number of SNP Variants per POI") + scale_fill_manual(values=cbPalette))#(snp_variant_plot_df %>%
# kbl() %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))INDEL
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_indel_variant_plot_df <- HG003_variant_df %>%
filter(var_type == "INDEL") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi
############## SEPARATE DFS
## too many VOIs to render plot nicely - split up and render separately
VOI_lst <- c("2_206241116_206241117", "2_49532029_49532030","2_49561482_49561483",
"3_127859142_127859143","3_162512133_162512134")
HG003_indel_voi_chrom_23 <- HG003_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("4_103764056_103764057", "5_117868224_117868225",
"6_156081224_156081225","6_77437124_77437125")
HG003_indel_voi_chrom_456 <- HG003_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("7_142824835_142824836", "7_16171773_16171774","9_10163173_10163174",
"10_66299841_66299842","10_81695836_81695837", "10_82524373_82524374")
HG003_indel_voi_chrom_7910 <- HG003_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("12_38824344_38824345", "12_53499604_53499605",
"13_42223740_42223741","13_90154843_90154844")
HG003_indel_voi_chrom_1213 <- HG003_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
VOI_lst <- c("14_105149647_105149648", "15_79844900_79844901","16_82791081_82791082")
HG003_indel_voi_chrom_141516 <- HG003_indel_variant_plot_df %>%
filter((VOI %in% VOI_lst))
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG003_indel_voi_chrom_23) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 2 & 3") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG003_indel_voi_chrom_456) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 4, 5, & 6") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG003_indel_voi_chrom_7910) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 7, 0, & 10") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG003_indel_voi_chrom_1213) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 12 & 13") + scale_fill_manual(values=cbPalette))ggplotly(ggplot(HG003_indel_voi_chrom_141516) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "INDEL Variants per POI: chroms 14, 15, & 16") + scale_fill_manual(values=cbPalette))########## TABLE
#(indel_variant_plot_df %>%
# kbl() %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))SNP or INDEL
########### CREATE ROOT PLOT DF
## create a summary df with the types, sizes, and counts of variants
HG003_snp_indel_variant_plot_df <- HG003_variant_df %>%
filter(var_type == "SNP,INDEL") %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG003_snp_indel_variant_plot_df) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count") + scale_fill_manual(values=cbPalette))(HG003_snp_indel_variant_plot_df %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))| VOI | var_size | count |
|---|---|---|
| 12_38824344_38824345 | 6 | 1 |
| 15_79844900_79844901 | 3 | 1 |
Coverage
(cov_df %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", html_font = "Cambria")))| giab_id | ref | benchmark | bench_type | region | chrom | start | end | chrom_start_end | n_regions | bp_cov | gene_size | cov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 2 | 49482029 | 49582030 | 2_49532029_49532030 | 9 | 79461 | 100001 | 0.7946020 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 2 | 49511482 | 49611483 | 2_49561482_49561483 | 9 | 79625 | 100001 | 0.7962421 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 2 | 206191116 | 206291117 | 2_206241116_206241117 | 15 | 96235 | 100001 | 0.9623404 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 3 | 127809142 | 127909143 | 3_127859142_127859143 | 12 | 97554 | 100001 | 0.9755303 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 3 | 162462133 | 162562134 | 3_162512133_162512134 | 8 | 20624 | 100001 | 0.2062379 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 4 | 103714056 | 103814057 | 4_103764056_103764057 | 24 | 97890 | 100001 | 0.9788902 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 5 | 117818224 | 117918225 | 5_117868224_117868225 | 11 | 88364 | 100001 | 0.8836312 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 6 | 77387124 | 77487125 | 6_77437124_77437125 | 3 | 64357 | 100001 | 0.6435636 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 6 | 156031224 | 156131225 | 6_156081224_156081225 | 16 | 96115 | 100001 | 0.9611404 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 7 | 16121773 | 16221774 | 7_16171773_16171774 | 13 | 94218 | 100001 | 0.9421706 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 7 | 142774835 | 142874836 | 7_142824835_142824836 | 8 | 32058 | 100001 | 0.3205768 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 9 | 10113173 | 10213174 | 9_10163173_10163174 | 17 | 97277 | 100001 | 0.9727603 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 10 | 66249841 | 66349842 | 10_66299841_66299842 | 7 | 98570 | 100001 | 0.9856901 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 10 | 81645836 | 81745837 | 10_81695836_81695837 | 34 | 88776 | 100001 | 0.8877511 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 10 | 82474373 | 82574374 | 10_82524373_82524374 | 14 | 98663 | 100001 | 0.9866201 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 12 | 38774344 | 38874345 | 12_38824344_38824345 | 15 | 86144 | 100001 | 0.8614314 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 12 | 53449604 | 53549605 | 12_53499604_53499605 | 32 | 93009 | 100001 | 0.9300807 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 13 | 42173740 | 42273741 | 13_42223740_42223741 | 11 | 98893 | 100001 | 0.9889201 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 13 | 90104843 | 90204844 | 13_90154843_90154844 | 26 | 96980 | 100001 | 0.9697903 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 14 | 105099647 | 105199648 | 14_105149647_105149648 | 19 | 93893 | 100001 | 0.9389206 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 15 | 79794900 | 79894901 | 15_79844900_79844901 | 19 | 94816 | 100001 | 0.9481505 |
| 50000 | HG002 | GRCh37 | v0.6.2 | SVs-Tier1-noVDJorXorY | 16 | 82741081 | 82841082 | 16_82791081_82791082 | 15 | 97592 | 100001 | 0.9759102 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 2 | 49482029 | 49582030 | 2_49532029_49532030 | 9 | 79461 | 100001 | 0.7946020 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 2 | 49511482 | 49611483 | 2_49561482_49561483 | 9 | 79625 | 100001 | 0.7962421 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 2 | 206191116 | 206291117 | 2_206241116_206241117 | 15 | 96235 | 100001 | 0.9623404 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 3 | 127809142 | 127909143 | 3_127859142_127859143 | 12 | 97554 | 100001 | 0.9755303 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 3 | 162462133 | 162562134 | 3_162512133_162512134 | 8 | 20624 | 100001 | 0.2062379 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 4 | 103714056 | 103814057 | 4_103764056_103764057 | 24 | 97890 | 100001 | 0.9788902 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 5 | 117818224 | 117918225 | 5_117868224_117868225 | 11 | 88364 | 100001 | 0.8836312 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 6 | 77387124 | 77487125 | 6_77437124_77437125 | 3 | 64357 | 100001 | 0.6435636 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 6 | 156031224 | 156131225 | 6_156081224_156081225 | 16 | 96115 | 100001 | 0.9611404 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 7 | 16121773 | 16221774 | 7_16171773_16171774 | 13 | 94218 | 100001 | 0.9421706 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 7 | 142774835 | 142874836 | 7_142824835_142824836 | 8 | 32058 | 100001 | 0.3205768 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 9 | 10113173 | 10213174 | 9_10163173_10163174 | 17 | 97277 | 100001 | 0.9727603 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 10 | 66249841 | 66349842 | 10_66299841_66299842 | 7 | 98570 | 100001 | 0.9856901 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 10 | 81645836 | 81745837 | 10_81695836_81695837 | 34 | 88776 | 100001 | 0.8877511 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 10 | 82474373 | 82574374 | 10_82524373_82524374 | 14 | 98663 | 100001 | 0.9866201 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 12 | 38774344 | 38874345 | 12_38824344_38824345 | 15 | 86144 | 100001 | 0.8614314 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 12 | 53449604 | 53549605 | 12_53499604_53499605 | 32 | 93009 | 100001 | 0.9300807 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 13 | 42173740 | 42273741 | 13_42223740_42223741 | 11 | 98893 | 100001 | 0.9889201 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 13 | 90104843 | 90204844 | 13_90154843_90154844 | 26 | 96980 | 100001 | 0.9697903 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 14 | 105099647 | 105199648 | 14_105149647_105149648 | 19 | 93893 | 100001 | 0.9389206 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 15 | 79794900 | 79894901 | 15_79844900_79844901 | 19 | 94816 | 100001 | 0.9481505 |
| 50000 | HG002 | GRCh37 | v4 | smallvar | 16 | 82741081 | 82841082 | 16_82791081_82791082 | 15 | 97592 | 100001 | 0.9759102 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 2 | 49482029 | 49582030 | 2_49532029_49532030 | 10 | 79445 | 100001 | 0.7944421 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 2 | 49511482 | 49611483 | 2_49561482_49561483 | 11 | 79501 | 100001 | 0.7950020 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 2 | 206191116 | 206291117 | 2_206241116_206241117 | 19 | 95588 | 100001 | 0.9558704 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 3 | 127809142 | 127909143 | 3_127859142_127859143 | 17 | 96973 | 100001 | 0.9697203 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 3 | 162462133 | 162562134 | 3_162512133_162512134 | 7 | 20663 | 100001 | 0.2066279 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 4 | 103714056 | 103814057 | 4_103764056_103764057 | 21 | 98498 | 100001 | 0.9849702 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 5 | 117818224 | 117918225 | 5_117868224_117868225 | 15 | 86373 | 100001 | 0.8637214 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 6 | 77387124 | 77487125 | 6_77437124_77437125 | 4 | 64249 | 100001 | 0.6424836 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 6 | 156031224 | 156131225 | 6_156081224_156081225 | 14 | 96472 | 100001 | 0.9647104 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 7 | 16121773 | 16221774 | 7_16171773_16171774 | 14 | 94044 | 100001 | 0.9404306 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 7 | 142774835 | 142874836 | 7_142824835_142824836 | 7 | 32353 | 100001 | 0.3235268 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 9 | 10113173 | 10213174 | 9_10163173_10163174 | 24 | 92091 | 100001 | 0.9209008 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 10 | 66249841 | 66349842 | 10_66299841_66299842 | 6 | 98730 | 100001 | 0.9872901 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 10 | 81645836 | 81745837 | 10_81695836_81695837 | 43 | 89273 | 100001 | 0.8927211 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 10 | 82474373 | 82574374 | 10_82524373_82524374 | 21 | 98563 | 100001 | 0.9856201 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 12 | 38774344 | 38874345 | 12_38824344_38824345 | 22 | 97372 | 100001 | 0.9737102 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 12 | 53449604 | 53549605 | 12_53499604_53499605 | 35 | 93468 | 100001 | 0.9346706 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 13 | 42173740 | 42273741 | 13_42223740_42223741 | 10 | 99195 | 100001 | 0.9919401 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 13 | 90104843 | 90204844 | 13_90154843_90154844 | 31 | 96018 | 100001 | 0.9601704 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 14 | 105099647 | 105199648 | 14_105149647_105149648 | 26 | 91743 | 100001 | 0.9174208 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 15 | 79794900 | 79894901 | 15_79844900_79844901 | 20 | 94957 | 100001 | 0.9495605 |
| 50000 | HG003 | GRCh37 | v4 | smallvar | 16 | 82741081 | 82841082 | 16_82791081_82791082 | 12 | 97945 | 100001 | 0.9794402 |
Analysis II: Inspection of Final POIs
Samantha went through the positions of interest and determined which 8 were best suited for the study. Those positions are (ref: GRCh37):
| chrom | start position |
|---|---|
| 2 | 49532029 |
| 3 | 127859142 |
| 3 | 162512133 |
| 5 | 117868223 |
| 6 | 77437124 |
| 7 | 142824835 |
| 9 | 10163173 |
| 12 | 38824344 |
Pull all INDELs surrounding these poi’s.
HG002
HG002_pois <- c("2_49532029_49532030", "3_127859142_127859143", "3_162512133_162512134",
"5_117868224_117868225", "6_77437124_77437125", "7_142824835_142824836",
"9_10163173_10163174", "12_38824344_38824345") ## list of final pois
HG002_final_poi_df <- HG002_variant_df %>%
filter(VOI %in% HG002_pois) %>% ## keep only the final pois
filter(!var_type == "SNP") ## keep INDEL and SNP,INDEL types only
HG002_final_poi_plot_df <- HG002_final_poi_df %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG002_final_poi_plot_df) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "HG002: INDEL Variants per POI: Final POIs") + scale_fill_manual(values=cbPalette)) ## write out final poi df for giab_id
write.table(HG002_final_poi_df, file = here("results/analysis/HG002_final_poi_df.tsv"), sep = "\t", row.names = FALSE)HG003
HG003_pois <- c("2_49532029_49532030", "3_127859142_127859143", "3_162512133_162512134",
"5_117868224_117868225", "6_77437124_77437125", "7_142824835_142824836",
"9_10163173_10163174", "12_38824344_38824345") ## list of final pois
HG003_final_poi_df <- HG003_variant_df %>%
filter(VOI %in% HG003_pois) %>% ## keep only the final pois
filter(!var_type == "SNP") ## keep INDEL and SNP,INDEL types only
HG003_final_poi_plot_df <- HG003_final_poi_df %>%
select(VOI, var_type, var_size) %>%
group_by(VOI, var_size) %>%
summarise(count = n()) ## count how many of each size per poi## `summarise()` has grouped output by 'VOI'. You can override using the `.groups` argument.
########## PLOT
## organize all poi plots in a gallery view
ggplotly(ggplot(HG003_final_poi_plot_df) +
geom_bar(aes(x = var_size, y = count), position = position_dodge(preserve = "single"), stat = "identity") +
scale_x_continuous(breaks = pretty_breaks(n=10)) +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
facet_wrap(~VOI) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Variant Variant Size", y = "Count", title = "HG003: INDEL Variants per POI: Final POIs") + scale_fill_manual(values=cbPalette)) ## write out final poi df for giab_id
write.table(HG003_final_poi_df, file = here("results/analysis/HG003_final_poi_df.tsv"), sep = "\t", row.names = FALSE)